BrewDog Analysis

This notebook systematically performs weak signal analysis on BrewDog locations in the UK in order to create a composite measure capable of predicting which LSOAs in the UK would support a new BrewDog location.

Workspace Setup

In this section we setup the workspace we will use within this notebook - the python libraries, and the local directory contianing the data inputs and outputs produced over the course of this analysis.

Dependencies

In [1]:
import datalab
from tensorflow.python.lib.io import file_io
import json
import math
import matplotlib.pyplot as plot
import numpy as np
import pandas as pd
import random
import os
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.decomposition import PCA
from google.datalab import Context
import google.datalab.storage as storage

Local Development

The local development workspace will be in /content/datalab/workspace/brewdog.

In [2]:
workspace_path = '/content/datalab/workspace/brewdog'
In [3]:
!mkdir -p {workspace_path}

Cleaning

If you have previously run this notebook, and want to start from scratch, then run the next cell to delete and create the workspace directory.

In [4]:
!rm -rf {workspace_path} && mkdir {workspace_path}

Loading Dataset

In order to begin our anlysis, we need to load the UK Master Dataset that we produced using our virtaulized data layer (DataPrep) from Cloud Storage.

In [5]:
!gsutil -q cp gs://princeton-thesis-data/uk-master/uk-master-data.csv {workspace_path}/data/uk-master-data.csv
In [6]:
!ls -l {workspace_path}/data
total 29564
-rw-r--r-- 1 root root 30273058 Mar  4 18:20 uk-master-data.csv

Next, create Pandas dataframe from the dataseat we just loaded in.

In [7]:
df_data = pd.read_csv(os.path.join(workspace_path, 'data/uk-master-data.csv'), dtype=str)
print '%d rows' % len(df_data)
df_data.describe()
34916 rows
Out[7]:
geography_code Economic_Activity_Economically_active_Total_measures_Value Economic_Activity_Economically_active_Employee_Part_time_measures_Value Economic_Activity_Economically_active_Employee_Full_time_measures_Value Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_measures_Value Economic_Activity_Economically_active_Unemployed_measures_Value Economic_Activity_Economically_active_Full_time_student_measures_Value Economic_Activity_Economically_inactive_Total_measures_Value Economic_Activity_Economically_inactive_Retired_measures_Value Economic_Activity_Economically_inactive_Student_including_full_time_students_measures_Value ... Age_Age_94_measures_Value Age_Age_95_measures_Value Age_Age_96_measures_Value Age_Age_97_measures_Value Age_Age_98_measures_Value Age_Age_99_measures_Value Age_Age_100_and_over_measures_Value Area_Population_Density_All_usual_residents_measures_Value Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare
count 34916 34916 34916 34916 34916 34916 34916 34916 34916 34916 ... 34916 34916 34916 34916 34916 34916 34916 34916 34916 34916
unique 34753 1244 330 983 33 222 459 940 484 741 ... 19 16 14 12 10 8 12 1841 17516 2114
top E01032739 743 164 422 2 29 29 323 132 35 ... 0 0 0 0 0 0 0 1481 24.60 0.4
freq 13 118 481 157 5295 733 1310 214 251 861 ... 16572 19256 21402 24428 27116 29328 27023 92 19 412

4 rows × 174 columns

Data Transformations

In order to perform weak signal analysis, we must standardize, vectorize, and normalize the dataset. This section carries out each of those steps. We also generate a random artificial dataset to use to verify that the correlations we discover are meaningful.

Standardize

In [8]:
# Standardize to Percentage of Population Indicators
def standardize_data(df):
  column_names = list(df)
  
  population_count_columns = [col for col in column_names if 'population_count' in col]
  total_population_column_name = 'Area_Population_Density_All_usual_residents_population_count'
  total_population_column = df[total_population_column_name]
  total_population_index = population_count_columns.index(total_population_column_name)
  del population_count_columns[total_population_index] # remove total population count from columns
  
  for col in population_count_columns:
    new_name = col.replace('population_count', 'percentage_of_population')
    df[new_name] = df[col] / total_population_column
    del df[col]
    
  return df

Vectorize

In [9]:
# Vectorize data based on analysis of indicators
def vectorize_data(df):
  # uncomment indicators to add to negative vectorization list
  negative_vector_columns = [
                            # 'Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population',
                            # 'Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population',
                            # 'Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population',
                            # 'Economic_Activity_Economically_active_Unemployed_percentage_of_population',
                            # 'Economic_Activity_Economically_active_Full_time_student_percentage_of_population',
                            'Economic_Activity_Economically_inactive_Total_percentage_of_population',
                            'Economic_Activity_Economically_inactive_Retired_percentage_of_population',
                            'Economic_Activity_Economically_inactive_Student_including_full_time_students_percentage_of_population',
                            'Economic_Activity_Economically_inactive_Long_term_sick_or_disabled_percentage_of_population',
                            # 'Qualification_All_categories_Highest_level_of_qualification_percentage_of_population',
                            'Qualification_No_qualifications_percentage_of_population',
                            # 'Distance_travelled_to_work_Less_than_2km_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Underground_metro_light_rail_tram_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Train_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Bus_minibus_or_coach_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Taxi_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Motorcycle_scooter_or_moped_percentage_of_population',
                            'Method_of_Travel_to_Work_Driving_a_car_or_van_percentage_of_population',
                            'Method_of_Travel_to_Work_Passenger_in_a_car_or_van_percentage_of_population',
                            # 'Method_of_Travel_to_Work_Bicycle_percentage_of_population',
                            # 'Method_of_Travel_to_Work_On_foot_percentage_of_population',
                            # 'Sex_Males_percentage_of_population',
                            'Sex_Females_percentage_of_population',
                            # 'Proficiency_in_English_Main_language_is_English_English_or_Welsh_in_Wales_percentage_of_population',
                            # 'Religion_Christian_percentage_of_population',
                            'Religion_Buddhist_percentage_of_population',
                            'Religion_Hindu_percentage_of_population',
                            'Religion_Jewish_percentage_of_population',
                            'Religion_Muslim_percentage_of_population',
                            'Religion_No_religion_percentage_of_population',
                            # 'Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population',
                            # 'Hours_Worked_Part_time_16_to_30_hours_worked_percentage_of_population',
                            # 'Hours_Worked_Full_time_31_to_48_hours_worked_percentage_of_population',
                            # 'Hours_Worked_Full_time_49_or_more_hours_worked_percentage_of_population',
                            # 'Cars_No_cars_or_vans_in_household_percentage_of_population',
                            'Cars_sum_of_All_cars_or_vans_in_the_area_percentage_of_population',
                            # 'Ethnic_Group_White_percentage_of_population',
                            'Ethnic_Group_Mixed_percentage_of_population',
                            'Ethnic_Group_Asian_percentage_of_population',
                            'Ethnic_Group_Black_percentage_of_population',
                            'Ethnic_Group_Other_percentage_of_population',
                            # 'Industry_A_Agriculture_forestry_and_fishing_percentage_of_popu,lation',
                            # 'Industry_B_Mining_and_quarrying_percentage_of_population',
                            # 'Industry_C_Manufacturing_percentage_of_population',
                            # 'Industry_D_Electricity_gas_steam_and_air_conditioning_supply_percentage_of_population',
                            # 'Industry_E_Water_supply_sewerage_waste_management_and_remediation_activities_percentage_of_population',
                            # 'Industry_F_Construction_percentage_of_population',
                            # 'Industry_G_Wholesale_and_retail_trade_repair_of_motor_vehicles_and_motor_cycles_percentage_of_population',
                            # 'Industry_H_Transport_and_storage_percentage_of_population',
                            # 'Industry_I_Accommodation_and_food_service_activities_percentage_of_population',
                            # 'Industry_J_Information_and_communication_percentage_of_population',
                            # 'Industry_K_Financial_and_insurance_activities_percentage_of_population',
                            # 'Industry_L_Real_estate_activities_percentage_of_population',
                            # 'Industry_M_Professional_scientific_and_technical_activities_percentage_of_population',
                            # 'Industry_N_Administrative_and_support_service_activities_percentage_of_population',
                            # 'Industry_O_Public_administration_and_defence_compulsory_social_security_percentage_of_population',
                            # 'Industry_P_Education_percentage_of_population',
                            # 'Industry_Q_Human_health_and_social_work_activities_percentage_of_population',
                            # 'Industry_R_S_Arts_entertainment_and_recreation_other_service_activities_percentage_of_population',
                            # 'Industry_T_Activities_of_households_as_employers_undifferentiated_goods_and_services_producing_activities_of_households_for_own_use_percentage_of_population',
                            # 'Industry_U_Activities_of_extraterritorial_organisations_and_bodies_percentage_of_population',
                            # 'General_Health_Very_good_health_percentage_of_population',
                            # 'General_Health_Good_health_percentage_of_population',
                            # 'General_Health_Fair_health_percentage_of_population',
                            'General_Health_Bad_health_percentage_of_population',
                            'General_Health_Very_bad_health_percentage_of_population',
                            # 'Age_0_to_17_percentage_of_population',
                            # 'Age_18_to_30_percentage_of_population',
                            # 'Age_30_40_percentage_of_population',
                            # 'Age_40_50_percentage_of_population',
                            # 'Age_50_60_percentage_of_population',
                            # 'Age_60_70_percentage_of_population',
                            # 'Age_70_80_percentage_of_population',
                            # 'Age_80_90_percentage_of_population',
                            # 'Age_90_100_percentage_of_population'
                            ]
  
  for col in negative_vector_columns:
    df[col] = df[col].apply(lambda x: x * -1)
      
  return df

Normalize

In [10]:
# Normalize data using Z-scores
def normalize_data(df):
  column_names = list(df)

  # Get numeric columns to normalize
  numeric_columns = [col for col in column_names if (col != 'geography_code' and 
                                                     col != 'BrewDog' and 
                                                     col != 'Starbucks' and 
                                                     col != 'Binomial_Artificial_Data' and 
                                                     col != 'Area_Population_Density_All_usual_residents_population_count' and
                                                     col != 'Area_Population_Density_Area_Hectares' and
                                                     col != 'Area_Population_Density_Density_number_of_persons_per_hectare')]

  # Normalize each column
  for col in numeric_columns:
    col_zscore = col + '_z_score' # name of normalized column
    #compute the zscore
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
    del df[col] # delete the original column (not normalized)

  return df

Bucket Ages

In [11]:
# Create composite age indicators (0-17, 18-30, 30-40, 40-50, 50-60, 60-70...)
def bucket_ages(df):
  column_names = list(df)
  del df['Age_Age_100_and_over_measures_Value'] # remove 100 and over
  age_columns = [col for col in column_names if 'Age_Age_' in col]
  
  ## 0 to 17
  age_0_to_17_columns = age_columns[0:17]
  df['Age_0_to_17_population_count'] = df[age_0_to_17_columns].sum(axis=1)
  for col in age_0_to_17_columns:
    del df[col]
    
  ## 18 to 30   
  age_18_to_30_columns = age_columns[17:30]
  df['Age_18_to_30_population_count'] = df[age_18_to_30_columns].sum(axis=1)
  for col in age_18_to_30_columns:
    del df[col]
    
  ## Rest of the age groups
  base_age = 30 # comes from custom groupings above
  for x in range(0,7):
    low = base_age + (x * 10)
    high = low + 10
    age_group_columns = age_columns[low:high]
    column_name = 'Age_' + str(low) + '_' + str(high) + '_population_count'
    df[column_name] = df[age_group_columns].sum(axis=1)
    for col in age_group_columns:
      del df[col]
      
  return df

Artificial Data

In [12]:
# Create Column for Artificial Data by Binomial Distribution to show that correlations are valid
def generate_artifical_data(df):
  number_of_brew_dogs = (df['BrewDog'] == 1).sum()
  number_of_areas = len(df.index)
  probability_of_brew_dog = float(number_of_brew_dogs) / float(number_of_areas)
  df['Binomial_Artificial_Data'] = np.random.binomial(1, probability_of_brew_dog, number_of_areas)
  
  return df

Transform

In [13]:
# Optionally person standardization, vectorization, and normalization on a dataframe
def transform_data(df_original, standardize, vectorize, normalize):
  df = df_original.copy(deep=True)
  
  column_names = list(df)
  
  # Convert the columns to the correct datatype (for the UK all columns are numerical except geography_code so need to filter that out)
  numeric_columns = [col for col in column_names if col != 'geography_code']
  df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric)
  
  # Create Age Buckets
  df = bucket_ages(df)
    
  # Rename columns with more descriptive name 
  df.columns = df.columns.str.replace('measures_Value', 'population_count')
  column_names = list(df)
  
  # Standardization
  if standardize:
    df = standardize_data(df)
      
  # Artifical Data
  df = generate_artifical_data(df)
  
  # Move Columns to the front for correlation matrices    
  column_names = list(df)
  column_names.insert(0, column_names.pop(column_names.index('Starbucks')))
  column_names.insert(0, column_names.pop(column_names.index('Binomial_Artificial_Data')))
  column_names.insert(0, column_names.pop(column_names.index('BrewDog')))
    
  df = df.loc[:, column_names]
  
  # Vectorization
  if (vectorize):
    df = vectorize_data(df)
      
  # Normalization 
  if (normalize):
    df = normalize_data(df)
  
  return df 

Standardized Data

In [14]:
df_data_s = transform_data(df_data, True, False, False) 
print '%d rows' % len(df_data_s)
df_data_s.describe()
34916 rows
Out[14]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population ... General_Health_Very_bad_health_percentage_of_population Age_0_to_17_percentage_of_population Age_18_to_30_percentage_of_population Age_30_40_percentage_of_population Age_40_50_percentage_of_population Age_50_60_percentage_of_population Age_60_70_percentage_of_population Age_70_80_percentage_of_population Age_80_90_percentage_of_population Age_90_100_percentage_of_population
count 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 ... 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000
mean 0.000802 0.000916 0.021509 1614.092880 436.157968 41.453110 0.511712 0.101285 0.283004 0.002762 ... 0.012831 0.199888 0.171420 0.131132 0.146778 0.122360 0.109727 0.071751 0.039203 0.007540
std 0.028307 0.030260 0.145075 306.433578 1471.873554 41.760838 0.066511 0.018947 0.066003 0.002034 ... 0.007120 0.049664 0.083689 0.042418 0.024105 0.027977 0.041037 0.031398 0.021837 0.006628
min 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 0.151662 0.009115 0.031820 0.000000 ... 0.000000 0.005413 0.039291 0.023608 0.008553 0.000684 0.000342 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 0.471735 0.091298 0.243161 0.001323 ... 0.007701 0.170903 0.123288 0.102402 0.133787 0.104975 0.079834 0.049146 0.023536 0.003127
50% 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 0.511782 0.102789 0.280027 0.002342 ... 0.011425 0.197412 0.156105 0.124651 0.147834 0.123440 0.107857 0.069058 0.036334 0.005775
75% 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 0.550727 0.113469 0.319119 0.003755 ... 0.016489 0.227760 0.192686 0.151556 0.161458 0.141269 0.138206 0.091074 0.050952 0.009889
max 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 0.920073 0.199467 0.741144 0.019837 ... 0.097649 0.479381 0.931163 0.435685 0.264611 0.247745 0.292265 0.292929 0.247717 0.084204

8 rows × 82 columns

Standardized and Vectorized Data

In [15]:
df_data_st = transform_data(df_data, True, True, False) 
print '%d rows' % len(df_data_st)
df_data_st.describe()
34916 rows
Out[15]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population ... General_Health_Very_bad_health_percentage_of_population Age_0_to_17_percentage_of_population Age_18_to_30_percentage_of_population Age_30_40_percentage_of_population Age_40_50_percentage_of_population Age_50_60_percentage_of_population Age_60_70_percentage_of_population Age_70_80_percentage_of_population Age_80_90_percentage_of_population Age_90_100_percentage_of_population
count 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 ... 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000
mean 0.000802 0.001088 0.021509 1614.092880 436.157968 41.453110 0.511712 0.101285 0.283004 0.002762 ... -0.012831 0.199888 0.171420 0.131132 0.146778 0.122360 0.109727 0.071751 0.039203 0.007540
std 0.028307 0.032972 0.145075 306.433578 1471.873554 41.760838 0.066511 0.018947 0.066003 0.002034 ... 0.007120 0.049664 0.083689 0.042418 0.024105 0.027977 0.041037 0.031398 0.021837 0.006628
min 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 0.151662 0.009115 0.031820 0.000000 ... -0.097649 0.005413 0.039291 0.023608 0.008553 0.000684 0.000342 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 0.471735 0.091298 0.243161 0.001323 ... -0.016489 0.170903 0.123288 0.102402 0.133787 0.104975 0.079834 0.049146 0.023536 0.003127
50% 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 0.511782 0.102789 0.280027 0.002342 ... -0.011425 0.197412 0.156105 0.124651 0.147834 0.123440 0.107857 0.069058 0.036334 0.005775
75% 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 0.550727 0.113469 0.319119 0.003755 ... -0.007701 0.227760 0.192686 0.151556 0.161458 0.141269 0.138206 0.091074 0.050952 0.009889
max 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 0.920073 0.199467 0.741144 0.019837 ... -0.000000 0.479381 0.931163 0.435685 0.264611 0.247745 0.292265 0.292929 0.247717 0.084204

8 rows × 82 columns

Standardized, Vectorized, and Normalized Data

In [16]:
df_data_stn = transform_data(df_data, True, True, True) 
print '%d rows' % len(df_data_st)
df_data_stn.describe()
34916 rows
Out[16]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 ... 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04
mean 0.000802 0.000659 0.021509 1614.092880 436.157968 41.453110 1.825400e-16 7.509173e-16 3.561261e-16 -2.035006e-19 ... -5.636967e-17 1.106535e-16 2.485760e-16 -1.023354e-16 4.424103e-16 1.748070e-16 -3.652836e-17 -4.126992e-16 -2.623123e-16 9.076127e-17
std 0.028307 0.025658 0.145075 306.433578 1471.873554 41.760838 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 ... 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 -5.413433e+00 -4.864558e+00 -3.805675e+00 -1.358135e+00 ... -1.191311e+01 -3.915913e+00 -1.578829e+00 -2.534894e+00 -5.734456e+00 -4.349254e+00 -2.665573e+00 -2.285248e+00 -1.795276e+00 -1.137504e+00
25% 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 -6.010763e-01 -5.271023e-01 -6.036662e-01 -7.076847e-01 ... -5.137678e-01 -5.836344e-01 -5.751411e-01 -6.773256e-01 -5.389705e-01 -6.214417e-01 -7.284476e-01 -7.199529e-01 -7.174496e-01 -6.657508e-01
50% 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 1.040814e-03 7.938616e-02 -4.509922e-02 -2.065185e-01 ... 1.974068e-01 -4.986110e-02 -1.830025e-01 -1.527924e-01 4.379444e-02 3.858206e-02 -4.556616e-02 -8.575912e-02 -1.313755e-01 -2.662219e-01
75% 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 5.865969e-01 6.430684e-01 5.471734e-01 4.881986e-01 ... 7.205289e-01 5.612153e-01 2.541142e-01 4.815028e-01 6.090181e-01 6.758942e-01 6.940035e-01 6.154506e-01 5.380000e-01 3.544514e-01
max 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 6.139788e+00 5.181914e+00 6.941278e+00 8.396627e+00 ... 1.802114e+00 5.627803e+00 9.078299e+00 7.179908e+00 4.888453e+00 4.481805e+00 4.448221e+00 7.044505e+00 9.548668e+00 1.156602e+01

8 rows × 82 columns

Correlation Analysis

In this section, we perform correlation analsysis on the data. We use this to eliminate redundant indicators as well as ensure that the BrewDog locations have some correlation with the indicators when compared to what is generated with the artifical data.

Note that this can be used to turn off correlation analysis during development as it takes a while to run

In [17]:
correlation_analysis_toggle = True

Visualization Function

The below function will allow us to visualize our correlation matrices.

In [21]:
# Visualize correlation matrix
def plot_correlation(data, name):
  min_value = 0
  max_value = 0

  for i in range(len(data.columns)):
    for j in range(len(data.columns)):
      if i != j:
        min_value = min(min_value, data.iloc[i, j])
        max_value = max(max_value, data.iloc[i, j])
  span = max(abs(min_value), abs(max_value))
  span = round(span + .05, 1)

  items = data.columns.tolist()
  ticks = np.arange(0.5, len(items) + 0.5)

  plot.figure(figsize = (66, 42))
  plot.pcolor(data.values, cmap = 'RdBu', vmin = -span, vmax = span)
  plot.colorbar().set_label('correlation')
  plot.xticks(ticks, items, rotation = 'vertical')
  plot.yticks(ticks, items)

  

Analysis Function

The below function will allow us to perform Spearman, Pearson, and Kendall-Tau correlation analysis all at once and visualize the results

In [22]:
def correlation_analysis(df):
  if (correlation_analysis_toggle):
    # correlate
    spearman_correlation = df.corr(method='spearman')
    pearson_correlation = df.corr(method='pearson')
    kendall_correlation = df.corr(method='kendall')
    # visualize
    plot_correlation(spearman_correlation, 'spearman')
    plot_correlation(pearson_correlation, 'pearson')
    plot_correlation(kendall_correlation, 'kendall')

Correlation Matrices After Data Transforms

This section shows correlation matrices generated after each transformation of the data in order to ensure their validity.

Standardization Correlation Matrices

In [23]:
correlation_analysis(df_data_s)
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family [u'sans-serif'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1298: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=400:stretch=normal:size=10.0. Returning /usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/ttf/STIXGeneral.ttf
  UserWarning)
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1298: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=400:stretch=normal:size=11.0. Returning /usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/ttf/STIXGeneral.ttf
  UserWarning)

Standardization and Vectorization Correlation Matrices

In [24]:
correlation_analysis(df_data_st)

Standardization, Normalization, and Vectorization Correlation Matrices

In [25]:
correlation_analysis(df_data_stn)

Correlation Matrices for BrewDog Population Density Range

In this section, we look at the correlations generated by filtering LSOAs on a particular density range. The density range is determined by the clustering of BrewDogs around a certain range. In this way, we can further highlight indicator correlations with LSOAs contianing BrewDogs.

Population Density Visualization Function

In [26]:
# Given a dataframe, this function visualizes the population density distribution of LSOAs
def visualize_population_density(df):
  df_fig, df_ax = plot.subplots()
  df.hist(column='Area_Population_Density_Density_number_of_persons_per_hectare', 
                           bins=175, 
                           range=(0,350), 
                           orientation='vertical',
                           ax=df_ax)
  df_ax.set_xscale('log')
  plot.title('LSOA Population Density Distribution')
  plot.ylabel('Number of LSOAs')
  plot.xlabel('Population Density (person per hectare)')

Population Density Distribution of All LSOAs

In [27]:
visualize_population_density(df_data_stn)
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:1298: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=400:stretch=normal:size=12.0. Returning /usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/ttf/STIXGeneral.ttf
  UserWarning)

Generate BrewDog Dataframe

In this section we generate a dataframe of transformed data that only contains LSOAs with a BrewDog location.

In [28]:
# Filter dataframe to contain only LSOAs with BrewDog location
def filter_brewdog(df_original):
  df = df_original.copy(deep=True)
  return df.loc[df['BrewDog'] == 1]
In [29]:
df_data_stn_brewdog = filter_brewdog(df_data_stn)
df_data_stn_brewdog.describe()
Out[29]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 28.0 28.0 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 ... 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000
mean 1.0 0.0 0.607143 1539.750000 32.313214 69.035714 1.198381 -2.685781 1.185715 0.218533 ... 0.626037 -2.707890 4.313741 1.490576 -2.553718 -2.437340 -1.859123 -1.685022 -1.185628 -0.720262
std 0.0 0.0 0.497347 468.517411 16.398344 59.246136 1.601177 1.100310 1.657231 0.919229 ... 0.877127 0.905927 2.173272 1.345553 1.623643 1.056217 0.662235 0.732901 1.046901 1.139231
min 1.0 0.0 0.000000 1020.000000 4.920000 18.900000 -1.612413 -4.441062 -2.174529 -1.358135 ... -2.655358 -3.720035 0.654564 -1.433397 -5.059481 -4.204706 -2.531818 -2.285248 -1.795276 -1.137504
25% 1.0 0.0 0.000000 1170.000000 22.020000 36.650000 0.159871 -3.712330 -0.052582 -0.244735 ... 0.220747 -3.472156 2.332244 0.867901 -3.760522 -3.252314 -2.393164 -2.174174 -1.735966 -1.137504
50% 1.0 0.0 1.000000 1380.000000 28.670000 45.200000 1.108527 -2.631764 1.574108 0.131987 ... 0.788591 -2.773914 5.066814 1.539856 -3.014265 -2.712457 -2.136397 -1.980944 -1.464853 -1.005345
75% 1.0 0.0 1.000000 1954.000000 44.837500 75.550000 2.001486 -1.897623 1.988080 0.532454 ... 1.203730 -2.047220 5.877150 1.928655 -0.985077 -1.538295 -1.567390 -1.323559 -1.035473 -0.852829
max 1.0 0.0 1.000000 2706.000000 71.250000 281.900000 4.671282 0.533221 5.397528 2.016552 ... 1.597519 -1.130466 8.212988 5.306175 0.444753 -0.080808 -0.005035 0.997771 3.744583 4.802126

8 rows × 82 columns

BrewDog Population Density Distribution

In [30]:
visualize_population_density(df_data_stn_brewdog)

Population Density Distribtuion for BrewDog Middle 50%

This section shows the population density distribution for the middle 50% of BrewDogs by population density of the LSOA they are contained in. It is this data that we will run through the correlation analysis to once again check that our indicators capture a signal corresponding to BrewDog location selection

In [31]:
# This function filters a dataframe on a population density range
def filter_density(df_original, minimum, maximum):
  df = df_original.copy(deep=True)
  return df.loc[(df['Area_Population_Density_Density_number_of_persons_per_hectare'] >= minimum) & (df['Area_Population_Density_Density_number_of_persons_per_hectare'] <= maximum)]
In [32]:
# choose density range currently showing middle 50%
min_density = 36.64 # 25%
max_density = 44.84 # 75%
In [33]:
df_data_stn_brewdog_50 = filter_density(df_data_stn_brewdog, min_density, max_density)
df_data_stn_brewdog_50.describe()
Out[33]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 6.0 6.0 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 ... 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000
mean 1.0 0.0 0.833333 1853.333333 43.923333 42.850000 1.232035 -2.141004 1.012914 -0.124428 ... 0.534851 -1.889649 3.299118 1.499204 -1.658795 -2.031060 -1.714512 -1.560361 -1.118900 -0.762661
std 0.0 0.0 0.408248 531.666688 15.515554 2.599808 0.735850 0.537459 0.940701 0.662983 ... 0.373668 0.596105 1.612865 0.834170 1.522433 0.634280 0.322764 0.351759 0.340646 0.666719
min 1.0 0.0 0.000000 1099.000000 25.410000 37.600000 0.325751 -3.136548 -0.245260 -1.358135 ... 0.220747 -2.540829 1.883770 0.229859 -3.220351 -3.040203 -2.319133 -2.227287 -1.670270 -1.137504
25% 1.0 0.0 1.000000 1600.750000 36.100000 43.450000 0.604102 -2.270256 0.280931 -0.099844 ... 0.220747 -2.441202 1.883770 1.034531 -2.973539 -2.404688 -1.691173 -1.606369 -1.309689 -1.056616
50% 1.0 0.0 1.000000 1954.000000 44.530000 43.900000 1.408622 -1.896378 1.574108 -0.099844 ... 0.407835 -1.783708 2.927172 1.680531 -1.600012 -1.762912 -1.638799 -1.428563 -0.971444 -0.983086
75% 1.0 0.0 1.000000 1954.000000 44.530000 43.900000 1.878356 -1.780271 1.574108 0.298768 ... 0.829357 -1.355936 4.597087 2.205122 -0.271866 -1.538295 -1.638799 -1.323559 -0.857835 -0.983086
max 1.0 0.0 1.000000 2676.000000 71.250000 44.500000 1.878356 -1.780271 1.750548 0.479459 ... 1.044438 -1.355936 5.366900 2.205122 -0.271866 -1.538295 -1.342913 -1.323559 -0.857835 0.591918

8 rows × 82 columns

In [34]:
df_data_stn_50 = filter_density(df_data_stn, min_density, max_density)
df_data_stn_50.describe()
Out[34]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 ... 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000 3307.000000
mean 0.001814 0.000605 0.018748 1580.698216 38.953006 40.728062 -0.041399 0.136941 0.063034 -0.248017 ... -0.079570 -0.009580 -0.002321 -0.050212 -0.002937 -0.034921 -0.041711 0.060939 0.139821 0.087634
std 0.042563 0.024589 0.135655 281.560343 7.356273 2.371161 0.959121 0.971795 0.919722 0.806693 ... 1.052863 0.938281 0.928207 0.781248 0.961663 0.867152 0.813217 0.940963 1.017579 1.050113
min 0.000000 0.000000 0.000000 1002.000000 22.880000 36.700000 -5.183032 -4.537152 -3.670813 -1.358135 ... -6.060955 -3.802526 -1.303693 -2.358611 -5.144761 -4.017157 -2.578414 -2.285248 -1.795276 -1.137504
25% 0.000000 0.000000 0.000000 1433.000000 34.715000 38.700000 -0.617056 -0.402534 -0.477271 -0.780975 ... -0.643277 -0.557712 -0.456334 -0.563082 -0.503008 -0.538227 -0.571791 -0.560374 -0.577600 -0.605597
50% 0.000000 0.000000 0.000000 1541.000000 37.850000 40.700000 -0.024582 0.211558 0.041065 -0.383749 ... 0.128230 -0.022233 -0.168589 -0.152547 -0.001626 -0.023225 -0.061855 -0.018836 0.010315 -0.164868
75% 0.000000 0.000000 0.000000 1671.000000 41.570000 42.800000 0.549661 0.755140 0.581111 0.162750 ... 0.685437 0.561117 0.172091 0.320405 0.548990 0.508015 0.485172 0.582436 0.680074 0.454564
max 1.000000 1.000000 1.000000 5254.000000 124.410000 44.800000 5.262861 3.134690 5.773073 5.120248 ... 1.717757 3.764599 8.330041 5.123712 4.888453 3.675726 3.275352 4.509697 9.548668 8.162492

8 rows × 82 columns

In [35]:
visualize_population_density(df_data_stn_brewdog_50)
In [36]:
visualize_population_density(df_data_stn_50)

Population Density Filtered Correlation Analysis

In [37]:
correlation_analysis(df_data_stn_50)

Remove Highly Correlated Indicators

In this section we remove indicators that are highly correlated. By removing highly correlated indicators we do not lose much in terms of signal but help to reduce the complexity of our model.

In [38]:
# This function takes a dataframe and and a threshold and filters columns that have 
# a correlation that exceeds the threshold.
def correlation_filtration(df_original, threshold):
  df = df_original.copy(deep=True)
  # Set of all the names of deleted columns
  cols_to_delete = set() 

  # filtration
  correlation_matrix = df.corr()
  for i in range(len(correlation_matrix.columns)):
      for j in range(i):
          if correlation_matrix.iloc[i, j] >= threshold:
              column_name = correlation_matrix.columns[i] 
              cols_to_delete.add(column_name)
              if column_name in df.columns:
                  del df[column_name] 

  return list(cols_to_delete)

List of Filtered Indicators

In [39]:
list_columns_to_delete = correlation_filtration(df_data_stn, 0.9)
print list_columns_to_delete
['Ethnic_Group_White_percentage_of_population_z_score', 'Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score', 'Hours_Worked_Full_time_31_to_48_hours_worked_percentage_of_population_z_score', 'Sex_Females_percentage_of_population_z_score']

Comparing Filtered Dataframe to Unfiltered

In [40]:
def filter_columns(df_original, columns):
  df = df_original.copy(deep=True)
  return df.drop(columns=columns)

All LSOAs Transformed and Filtered Dataframe

In [41]:
df_data_stn_corr = filter_columns(df_data_stn, list_columns_to_delete)
df_data_stn_corr.describe()
Out[41]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Unemployed_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 ... 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04
mean 0.000802 0.000659 0.021509 1614.092880 436.157968 41.453110 1.825400e-16 7.509173e-16 -2.035006e-19 -2.136756e-16 ... -5.636967e-17 1.106535e-16 2.485760e-16 -1.023354e-16 4.424103e-16 1.748070e-16 -3.652836e-17 -4.126992e-16 -2.623123e-16 9.076127e-17
std 0.028307 0.025658 0.145075 306.433578 1471.873554 41.760838 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 ... 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 -5.413433e+00 -4.864558e+00 -1.358135e+00 -1.724057e+00 ... -1.191311e+01 -3.915913e+00 -1.578829e+00 -2.534894e+00 -5.734456e+00 -4.349254e+00 -2.665573e+00 -2.285248e+00 -1.795276e+00 -1.137504e+00
25% 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 -6.010763e-01 -5.271023e-01 -7.076847e-01 -7.571892e-01 ... -5.137678e-01 -5.836344e-01 -5.751411e-01 -6.773256e-01 -5.389705e-01 -6.214417e-01 -7.284476e-01 -7.199529e-01 -7.174496e-01 -6.657508e-01
50% 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 1.040814e-03 7.938616e-02 -2.065185e-01 -2.706429e-01 ... 1.974068e-01 -4.986110e-02 -1.830025e-01 -1.527924e-01 4.379444e-02 3.858206e-02 -4.556616e-02 -8.575912e-02 -1.313755e-01 -2.662219e-01
75% 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 5.865969e-01 6.430684e-01 4.881986e-01 5.647153e-01 ... 7.205289e-01 5.612153e-01 2.541142e-01 4.815028e-01 6.090181e-01 6.758942e-01 6.940035e-01 6.154506e-01 5.380000e-01 3.544514e-01
max 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 6.139788e+00 5.181914e+00 8.396627e+00 7.992347e+00 ... 1.802114e+00 5.627803e+00 9.078299e+00 7.179908e+00 4.888453e+00 4.481805e+00 4.448221e+00 7.044505e+00 9.548668e+00 1.156602e+01

8 rows × 78 columns

Correlation Visualization Post Filtration

In [42]:
correlation_analysis(df_data_stn_corr)

BrewDog LSOAs Transformed and Filtered Dataframe

In [43]:
df_data_stn_brewdog_corr = filter_columns(df_data_stn_brewdog, list_columns_to_delete)
df_data_stn_brewdog_corr.describe()
Out[43]:
BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Unemployed_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 28.0 28.0 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 ... 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000
mean 1.0 0.0 0.607143 1539.750000 32.313214 69.035714 1.198381 -2.685781 0.218533 0.081076 ... 0.626037 -2.707890 4.313741 1.490576 -2.553718 -2.437340 -1.859123 -1.685022 -1.185628 -0.720262
std 0.0 0.0 0.497347 468.517411 16.398344 59.246136 1.601177 1.100310 0.919229 0.746118 ... 0.877127 0.905927 2.173272 1.345553 1.623643 1.056217 0.662235 0.732901 1.046901 1.139231
min 1.0 0.0 0.000000 1020.000000 4.920000 18.900000 -1.612413 -4.441062 -1.358135 -1.341683 ... -2.655358 -3.720035 0.654564 -1.433397 -5.059481 -4.204706 -2.531818 -2.285248 -1.795276 -1.137504
25% 1.0 0.0 0.000000 1170.000000 22.020000 36.650000 0.159871 -3.712330 -0.244735 -0.486565 ... 0.220747 -3.472156 2.332244 0.867901 -3.760522 -3.252314 -2.393164 -2.174174 -1.735966 -1.137504
50% 1.0 0.0 1.000000 1380.000000 28.670000 45.200000 1.108527 -2.631764 0.131987 0.021143 ... 0.788591 -2.773914 5.066814 1.539856 -3.014265 -2.712457 -2.136397 -1.980944 -1.464853 -1.005345
75% 1.0 0.0 1.000000 1954.000000 44.837500 75.550000 2.001486 -1.897623 0.532454 0.579095 ... 1.203730 -2.047220 5.877150 1.928655 -0.985077 -1.538295 -1.567390 -1.323559 -1.035473 -0.852829
max 1.0 0.0 1.000000 2706.000000 71.250000 281.900000 4.671282 0.533221 2.016552 1.396014 ... 1.597519 -1.130466 8.212988 5.306175 0.444753 -0.080808 -0.005035 0.997771 3.744583 4.802126

8 rows × 78 columns

Factor Analysis

This section performs factor analysis using Principal Component Analysis to generate our composite measure.

Principal Component Analysis

This section performs PCA on our filtered BrewDog location dataset

Setup Data for PCA

In [44]:
# Create PCA Dataframe
def create_pca_dataframe(df_original, features):
  df = df_original.copy(deep=True)
  
  return pd.DataFrame(df, columns=features)
In [45]:
  # Remove non-features to make features list
  features = list(df_data_stn_brewdog_corr)
  features.remove('Binomial_Artificial_Data')
  features.remove('BrewDog')
  features.remove('geography_code')
  features.remove('Area_Population_Density_All_usual_residents_population_count')
  features.remove('Area_Population_Density_Area_Hectares')
  features.remove('Area_Population_Density_Density_number_of_persons_per_hectare')
In [46]:
df_pca = create_pca_dataframe(df_data_stn_brewdog_corr, features)
df_pca.describe()
Out[46]:
Starbucks Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Unemployed_percentage_of_population_z_score Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score Economic_Activity_Economically_inactive_Total_percentage_of_population_z_score Economic_Activity_Economically_inactive_Retired_percentage_of_population_z_score Economic_Activity_Economically_inactive_Student_including_full_time_students_percentage_of_population_z_score Economic_Activity_Economically_inactive_Long_term_sick_or_disabled_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 ... 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000 28.000000
mean 0.607143 1.198381 -2.685781 0.218533 0.081076 2.542628 -1.869497 1.731053 -4.014423 0.234525 ... 0.626037 -2.707890 4.313741 1.490576 -2.553718 -2.437340 -1.859123 -1.685022 -1.185628 -0.720262
std 0.497347 1.601177 1.100310 0.919229 0.746118 2.753864 2.222447 0.640839 3.171138 1.048237 ... 0.877127 0.905927 2.173272 1.345553 1.623643 1.056217 0.662235 0.732901 1.046901 1.139231
min 0.000000 -1.612413 -4.441062 -1.358135 -1.341683 -0.554555 -6.586953 -0.312652 -10.342756 -3.981119 ... -2.655358 -3.720035 0.654564 -1.433397 -5.059481 -4.204706 -2.531818 -2.285248 -1.795276 -1.137504
25% 0.000000 0.159871 -3.712330 -0.244735 -0.486565 0.472428 -3.041766 1.641859 -6.130919 -0.025681 ... 0.220747 -3.472156 2.332244 0.867901 -3.760522 -3.252314 -2.393164 -2.174174 -1.735966 -1.137504
50% 1.000000 1.108527 -2.631764 0.131987 0.021143 1.576139 -2.191203 1.996577 -3.520660 0.528186 ... 0.788591 -2.773914 5.066814 1.539856 -3.014265 -2.712457 -2.136397 -1.980944 -1.464853 -1.005345
75% 1.000000 2.001486 -1.897623 0.532454 0.579095 4.117219 -0.011862 2.093154 -0.888908 0.750605 ... 1.203730 -2.047220 5.877150 1.928655 -0.985077 -1.538295 -1.567390 -1.323559 -1.035473 -0.852829
max 1.000000 4.671282 0.533221 2.016552 1.396014 9.960997 2.520281 2.374492 0.358319 1.275420 ... 1.597519 -1.130466 8.212988 5.306175 0.444753 -0.080808 -0.005035 0.997771 3.744583 4.802126

8 rows × 73 columns

In [47]:
# Separating out the features
x = df_data_stn_brewdog_corr.loc[:, features].values
In [48]:
# Separating out the target
target = 'BrewDog'
y = df_data_stn_brewdog_corr.loc[:, [target]].values

Run PCA

In [49]:
n_components = 10
pca = PCA(n_components=n_components)
pca.fit(x)
Out[49]:
PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Generate Skree Plots

In [50]:
# Plot the variance explained by each component
def visualize_variance(pca):
  variance = pca.explained_variance_ratio_
  plot.plot(variance)
  plot.title('PCA Ratio of Variance Explained by Components')
  plot.xlabel('number of components')
  plot.ylabel('ratio of variance explained')
In [51]:
visualize_variance(pca)
In [52]:
# Plot the cumalitve vriance explained by the components
def visualize_cumulative_variance(pca):
  variance = pca.explained_variance_ratio_
  cumulative_variance = np.cumsum(np.round(variance, decimals=4) * 100)
  plot.plot(cumulative_variance)
  plot.title('PCA Cumulative Variance Explained by Components')
  plot.xlabel('number of components')
  plot.ylabel('cumulative explained variance')
In [53]:
visualize_cumulative_variance(pca)

Factor Loadings from PCA

In this section we generate the factor loadings from the PCA components using varimax rotation.

In [54]:
# Run PCA with the desired number of components
factors_to_generate = 5
factor_pca = PCA(n_components=factors_to_generate)
factor_pca.fit(x)
Out[54]:
PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [55]:
# Create Dataframe of factor loadings using a given number of componenents
def generate_factor_loadings(data, pca):
    
    factors = ['Factor {}'.format(i) for i in range(1,len(pca.components_)+1)]
    
    components = pd.DataFrame(np.round(pca.components_, 4), columns = data.keys()) 
    components.index = factors

    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance']) 
    variance_ratios.index = factors

    return pd.concat([variance_ratios, components], axis = 1).transpose()
In [56]:
factor_loadings = generate_factor_loadings(df_pca, factor_pca)
factor_loadings.head()
Out[56]:
Factor 1 Factor 2 Factor 3 Factor 4 Factor 5
Explained Variance 0.3600 0.1674 0.1042 0.0929 0.0713
Starbucks 0.0049 -0.0627 -0.0226 0.0055 -0.0050
Economic_Activity_Economically_active_Total_percentage_of_population_z_score 0.1335 -0.1621 0.2527 -0.0194 -0.0681
Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score 0.1128 0.1092 -0.0269 -0.1002 -0.0520
Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score -0.0136 -0.1080 -0.0509 -0.0171 -0.1137

Squared factor loading (scaled to unity sum)

In this section we generate the squared factor loadings scaled to a unity sum. The result is a list of individual indicators with relative weightings.

In [57]:
def sqaure_factors(df):
  factor_names = list(df)
  
  # Sqaure the entries in each column
  for factor in factor_names:
    df[factor] = (df[factor] * df[factor])

  return df
In [58]:
def unit_scale_factors(df):
  factor_names = list(df)
  
  # Scale to sum to 1
  for factor in factor_names:
    df[factor] = (df[factor] / df[factor].mean()) / len(list(df[factor]))

  return df 
In [59]:
# Sqaure factor loadings and scale to unity sum
def square_scale_factors(df_original):
  df = df_original.copy(deep=True)
  
  # remove explained variance
  df = df.drop(['Explained Variance'])
  
  # square and scale
  df = sqaure_factors(df)
  df = unit_scale_factors(df)
  
  return df
In [60]:
factor_loadings_ss = square_scale_factors(factor_loadings)

Composite Measure

In this section we use our squared and scaled factor loadings to generate our composite measure.

Intermediate Composite Indicators

In this section we generate intermediate composite indicators. This is done by grouping the indicators with highest factor loadings into intermediate composite indicators.

In [61]:
# Determine the indicators to use from each factor by analyzing the sorted list of indicators 
# by loading for each factor
def identify_highest_loadings(df_original, factor):
  df = df_original.copy(deep=True)[factor]
  df = df.sort_values(ascending=False)
  return df
    
In [62]:
# Find the intermediate composite indicators for each factor
def find_intermediate_composite_indicators(df, loadings_per_factor):
  factors = list(df)
  highest_loadings_list = []
  for factor in factors:
    highest_loadings = identify_highest_loadings(df, factor)
    highest_loadings = highest_loadings.iloc[0:loadings_per_factor]
    highest_loadings_list.append(highest_loadings)
    
    print factor
    print '------------------'
    print highest_loadings
    print '------------------\n'
  
  return highest_loadings_list
In [63]:
intermediate_composite_indicators_list = find_intermediate_composite_indicators(factor_loadings_ss, 10)
Factor 1
------------------
Economic_Activity_Economically_inactive_Student_including_full_time_students_percentage_of_population_z_score    0.188712
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score                         0.089166
Economic_Activity_Economically_inactive_Total_percentage_of_population_z_score                                   0.083236
Age_18_to_30_percentage_of_population_z_score                                                                    0.082660
Religion_Buddhist_percentage_of_population_z_score                                                               0.059832
Age_40_50_percentage_of_population_z_score                                                                       0.036751
Method_of_Travel_to_Work_Underground_metro_light_rail_tram_percentage_of_population_z_score                      0.030347
Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population_z_score                                  0.027061
Ethnic_Group_Other_percentage_of_population_z_score                                                              0.025473
Industry_M_Professional_scientific_and_technical_activities_percentage_of_population_z_score                     0.023257
Name: Factor 1, dtype: float64
------------------

Factor 2
------------------
Religion_Buddhist_percentage_of_population_z_score                                              0.166787
Sex_Males_percentage_of_population_z_score                                                      0.084621
Industry_M_Professional_scientific_and_technical_activities_percentage_of_population_z_score    0.082712
Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population_z_score                 0.070968
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score        0.056310
Hours_Worked_Full_time_49_or_more_hours_worked_percentage_of_population_z_score                 0.046915
Industry_K_Financial_and_insurance_activities_percentage_of_population_z_score                  0.036251
Method_of_Travel_to_Work_Train_percentage_of_population_z_score                                 0.033379
Cars_No_cars_or_vans_in_household_percentage_of_population_z_score                              0.032797
Age_30_40_percentage_of_population_z_score                                                      0.032327
Name: Factor 2, dtype: float64
------------------

Factor 3
------------------
Religion_Buddhist_percentage_of_population_z_score                                                                  0.147144
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score                            0.075622
Method_of_Travel_to_Work_On_foot_percentage_of_population_z_score                                                   0.073114
Ethnic_Group_Other_percentage_of_population_z_score                                                                 0.069641
Economic_Activity_Economically_active_Total_percentage_of_population_z_score                                        0.063855
Distance_travelled_to_work_Less_than_2km_percentage_of_population_z_score                                           0.044562
Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population_z_score                                     0.035606
Industry_G_Wholesale_and_retail_trade_repair_of_motor_vehicles_and_motor_cycles_percentage_of_population_z_score    0.034261
Economic_Activity_Economically_inactive_Total_percentage_of_population_z_score                                      0.032977
Industry_K_Financial_and_insurance_activities_percentage_of_population_z_score                                      0.027821
Name: Factor 3, dtype: float64
------------------

Factor 4
------------------
Method_of_Travel_to_Work_On_foot_percentage_of_population_z_score                                                0.172803
Sex_Males_percentage_of_population_z_score                                                                       0.143790
Distance_travelled_to_work_Less_than_2km_percentage_of_population_z_score                                        0.097031
Method_of_Travel_to_Work_Underground_metro_light_rail_tram_percentage_of_population_z_score                      0.041534
Economic_Activity_Economically_inactive_Student_including_full_time_students_percentage_of_population_z_score    0.039919
Method_of_Travel_to_Work_Train_percentage_of_population_z_score                                                  0.039124
Industry_M_Professional_scientific_and_technical_activities_percentage_of_population_z_score                     0.029756
Industry_I_Accommodation_and_food_service_activities_percentage_of_population_z_score                            0.029240
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score                         0.029070
Method_of_Travel_to_Work_Motorcycle_scooter_or_moped_percentage_of_population_z_score                            0.028628
Name: Factor 4, dtype: float64
------------------

Factor 5
------------------
Ethnic_Group_Other_percentage_of_population_z_score                                                              0.357726
Religion_Buddhist_percentage_of_population_z_score                                                               0.306807
Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population_z_score                                  0.046311
Industry_I_Accommodation_and_food_service_activities_percentage_of_population_z_score                            0.033526
Industry_R_S_Arts_entertainment_and_recreation_other_service_activities_percentage_of_population_z_score         0.032689
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score                         0.022470
Religion_Muslim_percentage_of_population_z_score                                                                 0.013179
Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score    0.012928
Economic_Activity_Economically_active_Unemployed_percentage_of_population_z_score                                0.012882
Cars_No_cars_or_vans_in_household_percentage_of_population_z_score                                               0.012656
Name: Factor 5, dtype: float64
------------------

Calculate Weightings

From the list of intermediate composite indicators, we can use the explained variance of the factors they were generated as well as the squared and scaled factor loadings to compute the weightings for the indicators in the composite measure.

In [64]:
# Given an intermediate indicator list and the factors dataframe compute the weightings for the final composite measure
def calculate_weightings(intermediate_indicator_list, factors_df):
  # dictionary to track indicator weightings for composite measure
  indicator_weightings_dict = {}
  
  # iterate through factors and intermediate indicators calcualting weighting
  for i in range(0, factors_to_generate):
    # get the intermediate indicator
    intermediate_indicator = intermediate_indicator_list[i]
    
    # get the factor and its explained variance corresponding to the intermediate indicator
    factor = 'Factor ' + str((i + 1))
    explained_variance = factors_df[factor]['Explained Variance']
    
    for indicator, loading in intermediate_indicator.iteritems():
      # if the indicator is not already in the dictionary, add it
      if indicator not in indicator_weightings_dict:
        indicator_weightings_dict[indicator] = 0
      
      # get the indicator from the dictionary
      current_indicator_weight = indicator_weightings_dict[indicator]
      
      # add the intermediate indicator weight (product of factor loading and explained variance) to the current weight
      indicator_weightings_dict[indicator] = current_indicator_weight + (loading * explained_variance)
  
  return indicator_weightings_dict
    
In [65]:
composite_measure_indicator_weightings = calculate_weightings(intermediate_composite_indicators_list, factor_loadings)

Table of Indicators and Weights in Composite Measure

In [66]:
def visualize_weightings(weightings_dict):
  print "{:<150} {}".format('Indicator', 'Weight')
  print '----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------'
  for indicator in weightings_dict:
    print "{:<150}, {},".format(indicator, weightings_dict[indicator])
In [67]:
visualize_weightings(composite_measure_indicator_weightings)
Indicator                                                                                                                                              Weight
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Industry_K_Financial_and_insurance_activities_percentage_of_population_z_score                                                                        , 0.00896746642046,
Distance_travelled_to_work_Less_than_2km_percentage_of_population_z_score                                                                             , 0.0136574692573,
Ethnic_Group_Other_percentage_of_population_z_score                                                                                                   , 0.0419328295901,
Method_of_Travel_to_Work_On_foot_percentage_of_population_z_score                                                                                     , 0.0236718909228,
Religion_Muslim_percentage_of_population_z_score                                                                                                      , 0.000939671875992,
Age_40_50_percentage_of_population_z_score                                                                                                            , 0.013230191525,
Economic_Activity_Economically_active_Unemployed_percentage_of_population_z_score                                                                     , 0.000918510606576,
Hours_Worked_Full_time_49_or_more_hours_worked_percentage_of_population_z_score                                                                       , 0.00785350351157,
Age_30_40_percentage_of_population_z_score                                                                                                            , 0.00541160279579,
Method_of_Travel_to_Work_Train_percentage_of_population_z_score                                                                                       , 0.00922221304445,
Hours_Worked_Part_time_15_hours_or_less_worked_percentage_of_population_z_score                                                                       , 0.0286342805544,
Method_of_Travel_to_Work_Underground_metro_light_rail_tram_percentage_of_population_z_score                                                           , 0.0147834010953,
Religion_Buddhist_percentage_of_population_z_score                                                                                                    , 0.086667388617,
Sex_Males_percentage_of_population_z_score                                                                                                            , 0.0275236722014,
Industry_R_S_Arts_entertainment_and_recreation_other_service_activities_percentage_of_population_z_score                                              , 0.00233071571772,
Method_of_Travel_to_Work_Motorcycle_scooter_or_moped_percentage_of_population_z_score                                                                 , 0.00265955453932,
Economic_Activity_Economically_inactive_Student_including_full_time_students_percentage_of_population_z_score                                         , 0.0716447524019,
Cars_No_cars_or_vans_in_household_percentage_of_population_z_score                                                                                    , 0.00639253693955,
Economic_Activity_Economically_active_Self_employed_with_employees_Part_time_percentage_of_population_z_score                                         , 0.000921750500381,
Economic_Activity_Economically_active_Full_time_student_percentage_of_population_z_score                                                              , 0.0537086167931,
Age_18_to_30_percentage_of_population_z_score                                                                                                         , 0.0297575795687,
Industry_G_Wholesale_and_retail_trade_repair_of_motor_vehicles_and_motor_cycles_percentage_of_population_z_score                                      , 0.00356998077665,
Industry_M_Professional_scientific_and_technical_activities_percentage_of_population_z_score                                                          , 0.0249829309464,
Industry_I_Accommodation_and_food_service_activities_percentage_of_population_z_score                                                                 , 0.00510683387724,
Economic_Activity_Economically_inactive_Total_percentage_of_population_z_score                                                                        , 0.0334011986407,
Economic_Activity_Economically_active_Total_percentage_of_population_z_score                                                                          , 0.00665370472278,

Calculate Composite Measure for Each LSOA

In this section we use the indicator weightings to calculate the composite measure for each LSOA in the UK

In [68]:
# This function takes the UK master dataset and composite measure weightings to compute the a composite measure value for each LSOA
def calculate_composite_measure(df_original, weightings_dict):
  df = df_original.copy(deep=True)
  
  # initialize composite measure value to 0
  df['Composite_Measure'] = 0
  
  # calculate composite measure value
  for indicator in weightings_dict:
    weight = weightings_dict[indicator]
    df['Composite_Measure'] = df['Composite_Measure'] + (df[indicator] * weight)
    
  # Move composite measure to the front of dataframe
  column_names = list(df)
  column_names.insert(0, column_names.pop(column_names.index('Composite_Measure')))
  df = df[column_names]
  
  return df
In [69]:
df_data_stn_composite_measure = calculate_composite_measure(df_data_stn, composite_measure_indicator_weightings)
df_data_stn_composite_measure.describe()
Out[69]:
Composite_Measure BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 3.491600e+04 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 3.491600e+04 3.491600e+04 3.491600e+04 ... 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04
mean 4.354913e-17 0.000802 0.000659 0.021509 1614.092880 436.157968 41.453110 1.825400e-16 7.509173e-16 3.561261e-16 ... -5.636967e-17 1.106535e-16 2.485760e-16 -1.023354e-16 4.424103e-16 1.748070e-16 -3.652836e-17 -4.126992e-16 -2.623123e-16 9.076127e-17
std 1.401837e-01 0.028307 0.025658 0.145075 306.433578 1471.873554 41.760838 1.000014e+00 1.000014e+00 1.000014e+00 ... 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min -2.024047e+00 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 -5.413433e+00 -4.864558e+00 -3.805675e+00 ... -1.191311e+01 -3.915913e+00 -1.578829e+00 -2.534894e+00 -5.734456e+00 -4.349254e+00 -2.665573e+00 -2.285248e+00 -1.795276e+00 -1.137504e+00
25% -6.482864e-02 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 -6.010763e-01 -5.271023e-01 -6.036662e-01 ... -5.137678e-01 -5.836344e-01 -5.751411e-01 -6.773256e-01 -5.389705e-01 -6.214417e-01 -7.284476e-01 -7.199529e-01 -7.174496e-01 -6.657508e-01
50% 9.559553e-03 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 1.040814e-03 7.938616e-02 -4.509922e-02 ... 1.974068e-01 -4.986110e-02 -1.830025e-01 -1.527924e-01 4.379444e-02 3.858206e-02 -4.556616e-02 -8.575912e-02 -1.313755e-01 -2.662219e-01
75% 7.676183e-02 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 5.865969e-01 6.430684e-01 5.471734e-01 ... 7.205289e-01 5.612153e-01 2.541142e-01 4.815028e-01 6.090181e-01 6.758942e-01 6.940035e-01 6.154506e-01 5.380000e-01 3.544514e-01
max 1.042890e+00 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 6.139788e+00 5.181914e+00 6.941278e+00 ... 1.802114e+00 5.627803e+00 9.078299e+00 7.179908e+00 4.888453e+00 4.481805e+00 4.448221e+00 7.044505e+00 9.548668e+00 1.156602e+01

8 rows × 83 columns

Identify Top LSOAs for BrewDog based on Composite Measure

In this section we use the composite measure to create a list of the top predicted LSOAs for a BrewDog location.

In [70]:
df_data_stn_composite_measure_sorted = df_data_stn_composite_measure.sort_values(['Composite_Measure'], ascending=False)
df_data_stn_composite_measure_sorted.head(50)
Out[70]:
Composite_Measure BrewDog Binomial_Artificial_Data Starbucks geography_code Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
27471 1.042890 0 0 0 E01017137 1719 10.15 169.4 0.834121 -3.257851 ... 1.230166 -3.556356 8.149133 -1.857161 -5.003254 -3.749897 -2.461268 -2.062911 -1.315758 -0.961975
21622 0.992781 0 0 0 E01033660 1168 9.67 120.8 5.899779 -3.221865 ... 1.681862 -3.766309 5.890506 2.661056 -3.638467 -3.731044 -2.632182 -2.285248 -1.795276 -1.137504
17828 0.940229 0 0 0 E01033673 1101 6.44 171.0 6.139788 -3.476119 ... 1.546974 -3.787150 5.483666 3.225244 -3.413965 -3.627004 -2.607510 -2.256320 -1.795276 -1.000477
17082 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17081 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17080 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17079 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17078 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17077 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17076 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17073 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17075 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17074 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17072 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17071 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
17083 0.931430 0 0 1 E01032739 1044 165.36 6.3 4.446800 -3.829034 ... 1.667579 -2.732665 2.209424 3.095907 0.189288 -1.189572 -1.133357 -1.949666 -1.707548 -0.992996
18049 0.868010 0 0 1 E01032740 1099 63.73 17.2 4.441179 -3.664813 ... 1.418708 -3.493567 2.083325 2.335774 0.781052 -0.633382 -0.589595 -1.386845 -1.253582 -0.862951
18050 0.868010 0 0 1 E01032740 1099 63.73 17.2 4.441179 -3.664813 ... 1.418708 -3.493567 2.083325 2.335774 0.781052 -0.633382 -0.589595 -1.386845 -1.253582 -0.862951
21015 0.859656 0 0 0 E01011435 1820 15.01 121.3 0.765667 -3.605707 ... 1.647768 -3.726185 8.115033 -2.068146 -5.154701 -3.804156 -2.339174 -1.812751 -1.040430 0.023009
28735 0.832637 0 0 0 E01031880 3902 974.22 4.0 3.515276 -1.923575 ... 1.622137 -0.309438 4.076312 1.035108 -2.570074 -3.421012 -2.480309 -2.097512 -1.713124 -1.021512
22308 0.829284 0 0 0 E01033684 1203 4.83 249.1 5.129348 -3.020419 ... 1.802114 -3.639929 6.901135 1.592228 -5.020226 -4.136009 -2.653653 -2.285248 -1.795276 -1.137504
2872 0.807719 0 0 0 E01033034 1025 13.15 77.9 5.419938 -2.874078 ... 1.802114 -3.612365 5.564169 2.543593 -4.025088 -3.013675 -2.483715 -2.160956 -1.705922 -0.843130
32240 0.796958 0 0 0 E01032834 1130 3.52 321.0 4.879994 -3.944458 ... 1.553522 -3.116120 0.944253 4.919946 2.024400 -0.925787 -1.466255 -1.665163 -1.633174 -1.137504
15949 0.795941 0 0 0 E01033659 1336 18.11 73.8 5.203270 -2.856850 ... 1.591853 -3.482322 5.276801 3.649361 -3.946654 -3.758349 -2.619189 -2.237569 -1.726722 -1.137504
13283 0.790559 0 0 0 E01033682 1261 16.96 74.4 5.004555 -2.457697 ... 1.356581 -3.705540 4.357414 2.685498 -2.832233 -1.510747 -1.881587 -1.982157 -1.686329 -1.137504
2606 0.775337 0 0 0 E01032945 3163 30.91 102.3 0.515530 -3.359997 ... 1.091628 -3.445593 7.819278 -1.943636 -4.856368 -3.537450 -2.119199 -1.892538 -1.476760 -0.899017
12985 0.773230 0 0 0 E01021683 2368 35.72 66.3 4.211291 -2.336747 ... 1.208980 -2.502816 3.356049 3.718276 -2.234985 -3.301979 -1.932968 -1.733794 -1.253793 -1.010083
23946 0.770748 0 0 0 E01033015 1167 36.95 31.6 5.151292 -3.174819 ... 1.681759 -3.438256 5.846112 2.342760 -4.027412 -3.546717 -2.569502 -2.203372 -1.795276 -1.137504
18490 0.768461 0 0 1 E01033708 1599 35.73 44.8 4.407811 -3.365226 ... 1.275081 -3.080448 3.451729 4.398356 -1.600776 -2.808910 -2.292910 -2.086062 -1.737998 -1.137504
17949 0.752474 0 0 0 E01033664 1119 25.91 43.2 4.976724 -3.176031 ... 1.425561 -3.575042 6.013893 2.323045 -4.161413 -3.702902 -2.521469 -2.228323 -1.795276 -1.137504
19411 0.750332 0 0 0 E01032640 1248 4.42 282.4 5.028402 -3.992363 ... 1.351940 -2.960031 1.254930 5.503676 0.891577 -1.566849 -1.560914 -1.927958 -1.685195 -1.137504
27777 0.720636 0 0 1 E01033124 1105 26.47 41.7 2.415946 -3.339596 ... 0.658144 -3.441785 5.910578 -0.381913 -3.348561 -2.853356 -1.769730 -1.708781 -0.883541 -0.181788
31365 0.720061 0 0 0 E01033018 1137 41.02 27.7 5.186068 -3.256798 ... 1.555053 -3.458197 5.823202 2.880100 -4.666271 -3.650647 -2.588180 -2.257236 -1.795276 -1.004815
31919 0.719273 0 0 1 E01033016 1017 76.11 13.4 5.330903 -2.543258 ... 1.111582 -3.549722 4.296366 3.816522 -3.152202 -2.616360 -1.979029 -2.191296 -1.750248 -1.137504
31917 0.719273 0 0 1 E01033016 1017 76.11 13.4 5.330903 -2.543258 ... 1.111582 -3.549722 4.296366 3.816522 -3.152202 -2.616360 -1.979029 -2.191296 -1.750248 -1.137504
31918 0.719273 0 0 1 E01033016 1017 76.11 13.4 5.330903 -2.543258 ... 1.111582 -3.549722 4.296366 3.816522 -3.152202 -2.616360 -1.979029 -2.191296 -1.750248 -1.137504
22878 0.716953 0 0 0 E01032638 1159 4.06 285.5 4.824832 -3.706291 ... 1.802114 -2.965126 1.003403 4.516077 1.570819 -0.426084 -1.181087 -1.928002 -1.716253 -1.137504
25554 0.706689 0 0 0 E01013348 1913 48.40 39.5 1.934177 -1.124479 ... 0.847643 -1.972382 4.166741 0.223610 -2.337520 -2.280984 -1.731261 -1.336246 -0.933497 -0.979776
18027 0.695207 0 0 0 E01033493 1301 3.62 359.4 5.191973 -4.006925 ... 1.478238 -2.910548 3.673682 5.044803 -3.378804 -3.274725 -2.318025 -1.991476 -1.760077 -1.137504
32702 0.692596 0 0 0 E01019964 1892 1906.36 1.0 2.843660 -0.631296 ... 1.059756 -1.928315 3.629424 -1.820497 -2.668640 -1.161993 -0.510089 -0.197838 -0.391444 0.297802
4183 0.681080 0 0 0 E01033500 1333 131.56 10.1 4.961594 -3.009624 ... 1.486013 -2.756032 3.814207 4.142049 -2.665811 -3.301107 -2.162038 -2.165782 -1.692214 -1.137504
13145 0.680857 0 0 0 E01033352 1253 54.94 22.8 3.465713 -1.596824 ... 0.681171 -1.630467 3.759371 2.289636 -2.844547 -2.833245 -1.954322 -1.929384 -1.612539 -1.017099
7063 0.667784 0 0 0 E01032911 1154 26.29 43.9 4.292776 -2.189923 ... 1.193560 -2.716253 4.930651 2.220115 -3.321135 -2.948886 -2.357158 -2.174850 -1.676228 -1.137504
962 0.660153 0 0 0 E01033033 1580 10.87 145.4 2.688206 -3.408218 ... 0.113109 -3.413183 5.945517 -0.166942 -4.277541 -2.722223 -1.625128 -1.519239 -1.070688 -0.564593
4882 0.656648 0 0 1 E01033667 1006 13.28 75.8 4.442076 -1.305949 ... 0.405950 -3.544526 3.819357 3.517102 -1.511774 -2.561613 -1.995654 -2.158609 -1.249026 -1.137504
1013 0.654170 0 0 0 E01032767 1584 13.43 117.9 4.408505 -3.746309 ... 1.270091 -3.058793 4.492034 3.025596 -3.077338 -2.839224 -2.166227 -2.084176 -1.708545 -1.042260
16523 0.651048 0 0 0 E01030132 1956 794.70 2.5 3.006203 -1.163310 ... 1.299466 -0.092460 3.510848 0.295366 -2.207900 -3.021413 -1.826734 -1.519940 -1.584568 -0.983243
25454 0.638487 1 0 0 E01033761 1372 12.49 109.8 3.451203 -2.345129 ... 1.392626 -3.716702 6.634853 1.496428 -4.486679 -3.800548 -2.531818 -2.238820 -1.728521 -1.137504
19767 0.638039 1 0 0 E01033011 1373 30.37 45.2 4.154878 -2.847039 ... 1.597519 -3.716926 6.071539 1.836499 -3.913747 -3.358390 -2.389933 -2.215657 -1.761923 -1.027623
28630 0.637418 0 0 0 E01013368 2246 17.62 127.5 1.972746 -0.551882 ... 0.926621 -1.810507 3.638968 -0.309881 -1.453018 -1.556800 -1.404478 -1.164973 -1.102046 -0.667305

50 rows × 84 columns

Write Composite Measure to Cloud Storage

In [71]:
def write_cloud_storage(df_original):
  df = df_original.copy(deep=True)
  columns = ['geography_code', 'Composite_Measure']
  df = df[columns]

  project = Context.default().project_id
  bucket_name = project + "-composite-measure"
  bucket_path = 'gs://' + bucket_name
  bucket_object_name = 'composite-measure-output.json'
  
  bucket = storage.Bucket(bucket_name)
  bucket.create()
  bucket_object = bucket.object(bucket_object_name)
  
  df_json = df.to_json(orient='index')
  
  bucket_object.write_stream(df_json, 'text/plain')
In [72]:
write_cloud_storage(df_data_stn_composite_measure_sorted)

Visualize Composite Measure Distribution

In [73]:
# Given a dataframe, this function visualizes the population density distribution of LSOAs
def visualize_composite_measure(df):
  df_fig, df_ax = plot.subplots()
  df.hist(column='Composite_Measure', 
                           bins=175, 
                           orientation='vertical',
                           ax=df_ax)

  plot.title('LSOA Composite Measure Distribution')
  plot.ylabel('Number of LSOAs')
  plot.xlabel('Composite Measure Value')
In [74]:
visualize_composite_measure(df_data_stn_composite_measure_sorted)
In [75]:
df_data_stn_composite_measure_sorted.describe()
Out[75]:
Composite_Measure BrewDog Binomial_Artificial_Data Starbucks Area_Population_Density_All_usual_residents_population_count Area_Population_Density_Area_Hectares Area_Population_Density_Density_number_of_persons_per_hectare Economic_Activity_Economically_active_Total_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Part_time_percentage_of_population_z_score Economic_Activity_Economically_active_Employee_Full_time_percentage_of_population_z_score ... General_Health_Very_bad_health_percentage_of_population_z_score Age_0_to_17_percentage_of_population_z_score Age_18_to_30_percentage_of_population_z_score Age_30_40_percentage_of_population_z_score Age_40_50_percentage_of_population_z_score Age_50_60_percentage_of_population_z_score Age_60_70_percentage_of_population_z_score Age_70_80_percentage_of_population_z_score Age_80_90_percentage_of_population_z_score Age_90_100_percentage_of_population_z_score
count 3.491600e+04 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 34916.000000 3.491600e+04 3.491600e+04 3.491600e+04 ... 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04 3.491600e+04
mean 3.581611e-17 0.000802 0.000659 0.021509 1614.092880 436.157968 41.453110 1.693125e-16 7.553943e-16 3.776971e-16 ... -7.814423e-17 1.139603e-16 2.474567e-16 -9.768029e-17 4.363053e-16 1.693125e-16 -4.558414e-17 -4.167693e-16 -2.669928e-16 9.116827e-17
std 1.401837e-01 0.028307 0.025658 0.145075 306.433578 1471.873554 41.760838 1.000014e+00 1.000014e+00 1.000014e+00 ... 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min -2.024047e+00 0.000000 0.000000 0.000000 983.000000 1.830000 0.000000 -5.413433e+00 -4.864558e+00 -3.805675e+00 ... -1.191311e+01 -3.915913e+00 -1.578829e+00 -2.534894e+00 -5.734456e+00 -4.349254e+00 -2.665573e+00 -2.285248e+00 -1.795276e+00 -1.137504e+00
25% -6.482864e-02 0.000000 0.000000 0.000000 1434.000000 27.730000 11.200000 -6.010763e-01 -5.271023e-01 -6.036662e-01 ... -5.137678e-01 -5.836344e-01 -5.751411e-01 -6.773256e-01 -5.389705e-01 -6.214417e-01 -7.284476e-01 -7.199529e-01 -7.174496e-01 -6.657508e-01
50% 9.559553e-03 0.000000 0.000000 0.000000 1563.000000 47.310000 33.300000 1.040814e-03 7.938616e-02 -4.509922e-02 ... 1.974068e-01 -4.986110e-02 -1.830025e-01 -1.527924e-01 4.379444e-02 3.858206e-02 -4.556616e-02 -8.575912e-02 -1.313755e-01 -2.662219e-01
75% 7.676183e-02 0.000000 0.000000 0.000000 1735.000000 142.255000 56.100000 5.865969e-01 6.430684e-01 5.471734e-01 ... 7.205289e-01 5.612153e-01 2.541142e-01 4.815028e-01 6.090181e-01 6.758942e-01 6.940035e-01 6.154506e-01 5.380000e-01 3.544514e-01
max 1.042890e+00 1.000000 1.000000 1.000000 8300.000000 67283.520000 684.700000 6.139788e+00 5.181914e+00 6.941278e+00 ... 1.802114e+00 5.627803e+00 9.078299e+00 7.179908e+00 4.888453e+00 4.481805e+00 4.448221e+00 7.044505e+00 9.548668e+00 1.156602e+01

8 rows × 83 columns